An extension of simple regression with multiple predictors to better aid in the estimation and prediction of the response. The goal is to determine the effects (if any) of each predictor, controlling for the others.
\[ Y_i=\beta_0+\beta_1 X_{1,i}+\cdots+\beta_{p-1} X_{p-1,i}+\epsilon_i \Leftrightarrow Y_i=\sum_{k=0}^{p}\beta_kX_{k,i}+\epsilon_i\quad X_{0,i}\equiv 1 \]
where \(\epsilon_i\stackrel{iid}{\sim} N(0,\sigma^2)\)
Useful for accounting for potential curvature/nonlinearity in the relationship between predictors and the response.
Example:
Regression with 5 predictors but with only 2 unique predictors
\[ Y_i=\beta_0+\beta_1 X_{1,i}+\beta_2 X_{1,i}^2+\beta_3 X_{2,i}+\beta_4 X_{1,i}X_{2,i}+\beta_5 X_{1,i}^2X_{2,i} +\epsilon_i \]
In a model with polynomial/interaction terms special care needs to be taken as an increase in a predictor causes other changes too.
For example \[ E(Y|X_1,X_2)=\beta_0+\beta_1X_{1}+\beta_2X_{2}+\beta_3\underbrace{X_{1}X_{2}}_{X_3} \]
where a 1-unit increase in \(X_2\), i.e. \(X_2+1\), leads to
\[ E(Y|X_1,X_2+1)=E(Y|X_1,X_2)+\beta_2+\beta_3X_{1} \]
The effect of increasing \(X_2\) by 1, is dependent on the level of \(X_1\).
dat = matrix(c(0, 4779, 0, 4706, 0, 4350, 20, 5189, 20, 5140,
20, 4976, 30, 5110, 30, 5685, 30, 5618, 40, 5995, 40, 5628,
40, 5897, 50, 5746, 50, 5719, 50, 5782, 60, 4895, 60, 5030,
60, 4648), 18, 2, byrow = TRUE, dimnames = list(c(), c("flyash",
"strength")))
plot(dat)
dat = as.data.frame(dat)
reg.lin = lm(strength ~ flyash, data = dat)
summary(reg.lin)
##
## Call:
## lm(formula = strength ~ flyash, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -901.62 -203.18 31.56 327.55 653.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4924.595 213.299 23.088 1.03e-13 ***
## flyash 10.417 5.507 1.891 0.0768 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.8 on 16 degrees of freedom
## Multiple R-squared: 0.1827, Adjusted R-squared: 0.1317
## F-statistic: 3.578 on 1 and 16 DF, p-value: 0.0768
abline(reg.lin)
# Add 2nd order polynomial
flyash2 = dat$flyash^2
reg.2poly = lm(strength ~ flyash + flyash2, data = dat)
summary(reg.2poly)
##
## Call:
## lm(formula = strength ~ flyash + flyash2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -477.70 -214.01 27.04 287.87 390.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4486.3611 174.7531 25.673 8.25e-14 ***
## flyash 63.0052 12.3725 5.092 0.000132 ***
## flyash2 -0.8765 0.1966 -4.458 0.000460 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 312.1 on 15 degrees of freedom
## Multiple R-squared: 0.6485, Adjusted R-squared: 0.6016
## F-statistic: 13.84 on 2 and 15 DF, p-value: 0.0003933
curve(reg.2poly$coefficients[1] + reg.2poly$coefficients[2] *
x + reg.2poly$coefficients[3] * x^2, from = 0, to = 60, col = 2,
add = TRUE)
legend(0, 6000, legend = c("1st order", "2nd order"), lty = 1,
col = 1:2, bg = "light gray")
\[\begin{align*} Y &= \mbox{ lost work hours }\\ X_1 &= \mbox{ number of employees }\\ X_2 &= \begin{cases} 1 & \mbox{safety program used}\\ 0 & \mbox{no safety program used} \end{cases} \end{align*}\]
\[ E(Y_i)=\beta_0+\beta_1 X_{1,i} + \beta_2 X_{2,i} \]
implies
\[ E(Y_i)=\begin{cases} (\beta_0+\beta_2)+\beta_1 X_{1,i} & \mbox{ if }X_2=1\\ \beta_0+\beta_1 X_{1,i} & \mbox{ if }X_2=0 \end{cases} \]
dat = read.csv("https://raw.githubusercontent.com/lingxiaozhou/STA4210Rmaterial/main/data/safe_reg.csv",
header = TRUE)
reg1 = lm(y ~ x1 + x2, data = dat)
summary(reg1)
##
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.923 -16.260 -2.347 13.562 67.300
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.39945 9.90247 3.171 0.00305 **
## x1 0.01421 0.00140 10.148 3.07e-12 ***
## x2 -54.21033 7.24299 -7.485 6.47e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.87 on 37 degrees of freedom
## Multiple R-squared: 0.8154, Adjusted R-squared: 0.8054
## F-statistic: 81.73 on 2 and 37 DF, p-value: 2.658e-14
coef1 = coef(reg1)
plot(dat$x1, dat$y, pch = dat$x2 + 1, col = dat$x2 + 1)
abline(coef1["(Intercept)"], coef1["x1"], col = 1)
abline(coef1["(Intercept)"] + coef1["x2"], coef1["x1"], col = 2)
legend("bottomright", levels(x2), pch = 1:2, col = 1:2, title = "x2",
legend = c(0, 1))
\[ E(Y_i)=\beta_0+\beta_1 X_{1,i} + \beta_2 X_{2,i} +\beta_3 (X_1 X_2)_i \]
which implies
\[ E(Y_i)=\begin{cases} (\beta_0+\beta_2)+(\beta_1+\beta_3) X_{1,i} & \mbox{ if }X_2=1\\ \beta_0+\beta_1 X_{1,i} & \mbox{ if }X_2=0 \end{cases} \]
reg2 = lm(y ~ x1 + x2 + x1:x2, data = dat) #same as lm(y~x1+x2+x1*x2) or lm(y~x1*x2)
summary(reg2)
##
## Call:
## lm(formula = y ~ x1 + x2 + x1:x2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.407 -10.328 -2.760 8.563 65.534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.844082 10.127410 -0.182 0.857
## x1 0.019749 0.001546 12.777 6.11e-15 ***
## x2 10.725385 14.054508 0.763 0.450
## x1:x2 -0.010957 0.002174 -5.041 1.32e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.75 on 36 degrees of freedom
## Multiple R-squared: 0.8918, Adjusted R-squared: 0.8828
## F-statistic: 98.9 on 3 and 36 DF, p-value: < 2.2e-16
coef2 = coef(reg2)
plot(dat$x1, dat$y, pch = dat$x2 + 1, col = dat$x2 + 1)
abline(coef2["(Intercept)"], coef2["x1"], col = 1)
abline(coef2["(Intercept)"] + coef2["x2"], coef2["x1"] + coef2["x1:x2"],
col = 2)
legend("bottomright", levels(x2), pch = 1:2, col = 1:2, title = "x2",
legend = c(0, 1))